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Abstract 

Information has an entropic character which can be analyzed within the framework of the Statistical 
Theory in molecular systems. R. Landauer and C.H. Bennett showed that a logical copy can be car- 
ried out in the limit of no dissipation if the computation is performed sufficiently slowly. Structural 
and recent single-molecule assays have provided dynamic details of polymerase machinery with insight 
into information processing. Here, we introduce a rigorous characterization of Shannon Information in 
biomolccular systems and apply it to DNA replication in the limit of no dissipation. Specifically, we 
devise an equilibrium pathway in DNA replication to determine the entropy generated in copying the 
information from a DNA template in the absence of friction. Both the initial state, the free nucleotides 
randomly distributed in certain concentrations, and the final state, a polymerized strand, are mesoscopic 
equilibrium states for the nucleotide distribution. We use empirical stacking free energies to calculate the 
O f probabilities of incorporation of the nucleotides. The copied strand is, to first order of approximation, 

a state of independent and non-indentically distributed random variables for which the nucleotide that 
is incorporated by the polymerase at each step is dictated by the template strand, and to second order 
of approximation, a state of non-uniformly distributed random variables with nearest-neighbor interac- 
tions for which the recognition of secondary structure by the polymerase in the resultant double-stranded 
polymer determines the entropy of the replicated strand. Two incorporation mechanisms arise naturally 
and their biological meanings are explained. It is known that replication occurs far from equilibrium 
and therefore the Shannon entropy here derived represents an upper bound for replication to take place. 
Likewise, this entropy sets a universal lower bound for the copying fidelity in replication. 
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Introduction 

Many of the proteins in the cell are molecular motors which move along a molecular track and develop a 
mechanical work. Most of them work alone and therefore only an individual protein develops a certain task 
without requiring or optimizing that task by working in cooperation. The single-molecule experimental 
approach to the study of these motors sheds light on their complex stochastic dynamics and its connection 
to their biological function pp. 

Kincsin is probably the best characterized molecular motor at the single molecule level [2]. It is 
known that one of the roles of this protein is to transport cargoes along the microtubules with high 
processivity, that is, to transport a cargo for long distances without detaching from the microtubular 
track. Polymerases on the other hand have a more complex task. They not only have to translocate 
along a DNA template but most importantly, they have to copy a DNA single strand so that the fidelity 
in the so-called polymerization reaction is crucial for the cell division. To do so, DNA/RNA polymerase 
actually works as both a Turing Machine and a Maxwell's Demon [3HS] : it is capable of successively 
reading one nucleotide at a time, identifying a complementary nucleotide in the environment and writing 
the information by catalyzing a phosphodiester bond in the nascent replicated strand. Moreover, it is also 
capable of identifying errors in the copied strand by recognizing the secondary structure of the resulting 
double-stranded polymer [THS]. Some of these proteins can correct a wrong nucleotide by removing it 
and resuming the process in that position by the so-called proofreading mechanism [10| , and some others 
include strand displacement activity . DNA polymerase acts as a channel from the information point 
of view since it passes the genetic information from a template strand to a copied one. The pairing process 
follows spontaneously by hydrogen bonding and the emerging helical structure of the double-stranded 
polymer is mainly the result of the stacking interactions between the new base-pair and its immediate 
previous neighbor in the polymer chain [12] . 

Kinesin uses the energy from the ATP hydrolysis to move along the microtubules in individual steps of 
8 nm by developing 5— 8 pN forces, with an efficiency of ~ 60% [5] ■ More complexly, DNA polymerase uses 
part of the energy from deoxyribonucleotide triphosphates (dATP, dCTP, dGTP and dTTP) hydrolysis 
for its own motion. Another part of the hydrolysis is used to branch each incorporated nucleotide, that 
is, it is spent in the phosphodiester bond formation which leads to the nucleobase incorporation in the 
nascent copied strand. The remaining energy from the triphosphate nucleotides (dNTPs) hydrolysis plus 
that from the secondary structure formation is still very high what makes paradoxically low the turnover 
efficiency of this enzyme (~ 23%) [T3]- Besides, it is intriguing that the step of DNA polymerase is much 
shorter than that of kinesin (0.34 nm) but the forces developed in each step are much higher (~ 10 — 30 

P N (I3IIEI). 

Although fidelity is the main role of this enzyme, the energy spent in accurately copying a single strand 
has only been included in the discussion of the energy balance in the case of independent nucleotide 
incorporations [T5JII7]- Here we calculate the entropy that is needed to order free nucleotides in a 
reservoir by following a sequence from a DNA template when no dissipation is present. Our theoretical 
framework allows the natural inclusion of interactions from near neighbors in the replication process. 
These interactions are closely related to the secondary structure formation of double-stranded DNA and, 
subsequently, to error recognition by DNA polymerases [7] . On the light of this theoretical framework, 
we discuss the implication in the energy comsumption by DNA polymerases. 

DNA replication is a non-equilibrium process in which dynamical order is naturally generated [18] . 
Therefore, our calculation marks a lower bound for the energy that must be spent in the ordering process 
otherwise limiting polymerization. As previously formulated p~8]fl9] , a natural consequence of the present 
analysis is that DNA polymerase spends an energy in channeling information from a template strand to a 
copied one with a fidelity which is increased in the presence of dissipation. Our analysis allows envisioning 
how far from equilibrium this process occurs. 
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Theoretical framework 

We start by developing a theoretical framework to analyze information transfer in biomolecular systems. 
As in former literature 18,20,21 , we use a mcsoscopic approach to study genetic copying at the level of 
a single DNA polymer. The process of ordering nucleotides according to a prescribed template sequence 
is shown schematically in Fig. [T]A. From a thermodynamic point of view, the initial and final states are 
mesoscopic equilibrium states although, as we study later in this article, the final state is different if the 
copying process occurs in or out of equilibrium |22j . We will state that a process occurs in equilibrium when 
it takes place through an infinite number of small transitions between equilibrium states. Then, we will 
say that the polymerase works in equilibrium when the nucleotide selection and incorporation procedures 
performed by this enzyme takes place in the absence of friction or other forces which irreversibly release 
heat 23 25 . We suppose that the concentration of dNTPs is larger than that of pyrophosphate (PPi) 
so that the process we study is phosphate-hydrolysis driven at all time but the motion of the polymerase 
is very slow so as to preserve equilibrium conditions. 

A system which transitions between two equilibrium states can increase its order if the system drives 
external energy through appropriate dynamical paths |26j . In particular, Andrieux and Gaspard |18j 
showed that non-equilibrium temporal ordering in copolymerization generates information at the cost of 
dissipation. Based on experimental evidence from both structural and single-molecule studies, here we 
mathematically model how DNA polymerase 'demon' channels energy from dNTP hydrolysis to order 
nucleotides according to a template pattern by using a minimum equilibrium description. Our scheme 
is an idealization of real non-equilibrium copying mechanisms but will lead to a universal (polymerasc- 
independent) entropic upper bound for polymerization to take place. 

A sequence in the template strand can be identified as a vector of parameters y = . . . , yi, . . . , y n ), 
where i is an index which runs over the nucleotide position and n is the number of nucleotides in the tem- 
plate DNA strand. The copied strand is represented here as a sequence of nucleotides given by the vector 
x = [x\, . . . , Xi, . . . , x n ) which stems from a multivaluate random variable X. In replication, variables Xi 
and parameters Yi take values over the same alphabet, namely Xdna = yDNA = {A, C, G, T}. In tran- 
scription, they take values over isomorph alphabets, Xdna = {A,C,G,T} and yRNA = {A,C,G,U}. 
A, C, G, T and U stand for Adenine, Cytosine, Guanine, Thymine and Uracyl nucleotide class, re- 
spectively. Hence, we can express without loss of generality for both replication and transcription: 
Xi , yi eX = {A,C,G,T}. 

Since we are only dealing with the ordering process, we do not need to take into account the number 
of phosphates in the nucleotide or the oxidative state. In the initial state, the nucleotides are independent 
nucleobase entities with a triphosphate tail and in the final state, they are monophosphate molecular sub- 
units assembled in a linear chain by phosphodiester bonds. However, the nucleobase information remains 
the same in both cases. In the case of replication, A, C, G and T are deoxynuclcotidcs and in transcription 
A, C , G and U are oxynucleotides. Therefore, as mentioned, we neglect the chemical condition without 
loss of generality in the informational entropy analysis. Exact replication and transcription involve a 
bijection between variables Xi and Yi by the so-called Watson- Crick (WC) base-pairing rules, but as we 
will see, non-WC unions can take place and give rise to copying errors. In translation, the analysis is a bit 
more difficult since complementarity is replaced by the so-called genetic code which involves a surjective 
correspondence between variables Xi (individual aminoacids) and parameters Yi (triads of nucleotides) . 
This case will not be treated here. 

The probability of having n nucleotides in a certain sequence can be expressed as Pr{Xi = x\, . . . , X n — 
x n} = p{xi, ■ ■ ■ , x n ). The corresponding entropy is, according to Gibbs formula, S(X±,...,X n ) = 

Ex x pO^Ij • • • > x n) hip(xi, . . . , x n ), where k is the Boltzmann constant, "In" is the natural loga- 
rithm, and the random variables, Xi, take values, Xi, over the genetic alphabet X. The calculation of these 
probabilities and their associated entropy involves the very complex analysis of the architecture of genomes 
and it is similar to that of generating text in a human language. Here, we calculate the entropy of copying 
the information from a given DNA strand and therefore we are only dealing with the conditional entropy 
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of the sequence X for a given sequence Y. Then, we simply need to use the probabilities of base-pairing 
nucleotides according to the template sequence, namely Pr{Xi = x±, . . . , X n = x n \ |Y} = p(x\, . . . , a;„||y), 
where we have used a double bar to address the conditional character introduced by the complementarity 
in the base-pair formation. These probabilities can be expressed as a product of conditional probabilities 
in which each new, incorporated nucleotide, x, depends not only on the template nucleotide, y, in front but 

also on the previous base-pairs in the sequence, p(xi, . . . , x n \\y) = Yl"—iP (a^l^i-ii • • • , ^llly^)) ; where 

yVj) is the parameter vector of the i th first nucleotides in the template strand. The entropy reads [27] : 

S(X 1 ,...,X n \\Y) = -fc p(xi, . . . ,x„||y)lnp(xi, . . . ,x„||y) 

X± , . ..,X n 

71 

= p( a; i)--->a;i||y(i))lni>fx i |a;i_i,...,xi||y w J 

i—1 x\ ,...,Xi 

71 

= £S(X i |X i _ 1 ,...,X i ||Y w ). (1) 

i=l 

The last part of this equation implies that the total entropy can be expressed as a sum over (double) 
conditional entropies. 

Polymerase supervision 

Polymerization is a spontaneous process at both room and physiological temperatures (37° C) since the 
free energies of nucleotide incorporation for WC base-pairing are negative. However, the process in the 
absence of a catalyst may never occur. The biological catalyst or enzyme, the so-called polymerase, is 
not only able to accelerate the chemical reaction; it also has the capacity for recognizing the secondary 
structure in the nascent double-helix polymer by a complex mechanism in which the polymerase structure 
is involved [7H5]. Its size covers approximately one helical turn of the double-stranded polymer and 
this determines a natural length or number of chained nucleotides, /, over which correct copying is 
supervised, as represented in Fig. [1] B and C. One helical turn involves a number of nucleotides I ~ 10. 
This fact imposes a natural truncation over the conditional probabilities of nearest base-pair neighbors. 
In other words, polymerase error recognition mechanism can be envisioned as a process in which this 
molecular machine supervises the copied strand by establishing correlations along I previous base-pairs 
at each position, i, which mathematically involve conditional probabilities. Then, the probability can be 
approximated by: 

p(xi, . . . ,x n \\y) ~ p(x 1 \\yi)p(x 2 \x 1 \\y 2 \yi)---p(xi\xi^ 1 ,...,xi\\y^ x 

n 

Yl p{x t \x t -i 1 ... 1 x l -i\\y ( ^ , (2) 

i=l+l 

where y|'| is the parameter vector of I nucleotides which ranges between template positions i — I and i. 
Eq. [T]now establishes: 
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S(X U . . .,X n \\Y) ~ 5(Xx|y x ) + S^IXiHFalF!) + • ■ ■ + 5 (jf,]*,-!, . . . , Jfi||Yjg) + 

n 

s(x i \X i -. 1 ,...,X i - l \\Y§). (3) 

i—l+l 

Non-equilibrium paths which increase the fidelity of the copolymerization process are ultimately de- 
termined by the above polymerase-DNA structural fitting assumptions. Dynamical time evolutions are 
therefore concomitant to the basic mechanisms which appear in equilibrium. Then, the equilibrium 
description will provide an upper entropy bound for DNA replication. 

Entropy and Mutual Information 

The total entropy of the final state is 5(X, Y) = S(Y) + S(X.\ | Y). Parameters Y have been fixed through- 
out evolution and therefore, we assume within the polymerization problem that there is no uncertainty 
in determining these parameters. Hence, we set S(Y) = without loss of generality. In these conditions, 
the final entropy is given by the conditional entropy 5(X||Y), as expressed by Eqs. [I] and [3] 

The mutual information is /(X; Y) = 5(X) - 5(X||Y) [27], where 5(X) is the entropy of the initial 
state. 5*(X) is the entropy of the reservoir and it is fixed for given nucleotide concentrations, as will be 
addressed later in this article. Then, the lower the entropy of the final state, the higher the information 
acquired in the copy, the higher the fidelity and the lower the number of copying errors. 

Finally, the entropy change in the polymerization process, A5(X,Y) = 5(X||Y) — S*(X), and the 
mutual information arc equal but opposite in sign in these conditions (note that information is not defined 
in bits), /(X; Y) = -AS(X, Y). 

Results 

The entropy in Eq.[3]does not only depend on the number / of nucleotides that are imposed by the fitting 
length of the polymerase to the DNA template but also on the supervising mechanism (e.g. see [9]) that 
the polymerase establishes by its architecture (molecular structure, e.g. see [7J). Due to the different 
polymerases that exist in nature and their diverse structure and both polymerization and proofreading- 
mechanisms, not mentioning the cooperative associations of co-factor proteins in eukaryotic replication, 
the calculation of the entropy cost of copying a nucleotide strand needs of a heuristic model to establish 
correlations over the I nucleotides that the polymerase supervises. Then, the energy spent in the ordering 
process involved in polymerization is polymerase-dependent. However, a general upper bound for the 
entropy can be calculated for all the polymerases based on the fact that the secondary structure of the 
double- helix of nucleic acids depends majorly on the immediate neighbors [12]. As a first approximation, 
we calculate the uppest bound by supposing that no influence of the previous base-pairs exist (1 = 0). 
The picture of the polymerase within this approximation is that of a nanomachine which reads one 
nucleotide at a time and writes a complementary nucleotide to the replicated strand. Later, we introduce 
the sequence-dependent effects in an cither, (1), reversible or, (2), irreversible copying process. 

Entropy of nucleotides ordered with no neighbor influence 

The picture of the polymerase within this approximation is shown in Fig.[T]4. The polymerase only uses 
the information of one genetic symbol to decide the correct nucleotide to write in the replicated strand, 
and thus it is represented as if it only covered one position in the template strand. In this case, the 
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random variables Xi arc independent although not identically distributed. Therefore, the probabilitcs 
and entropies are: 

n 

p(xi,...,x n \\y) = Y[p( x i\\Vi)> ( 4 ) 

1=1 

n n 

S(X 1 ,...,X n \\Y) = Y,S(X i \\Y i ) = Y, E Pfa\\Vi)fopfa\\Vi)- ( 5 ) 

i=l i=l XiGX 

It is important to note that in general the probability, p(xi\\yi), depends implicitly on i through the 
random variable Xi = X(i) and explicity on the values of the parameters y^. The former dependence 
implies that the copying process may be subjected to local property changes, such as nucleotide concen- 
tration or temperature and ionic gradients, so that the polymerase position i influences the probabilities. 
The latter dependence is explicit in the values of the parameters yi and addresses exclusively the sequence 
dependence along the template. We assume that polymerization is position-invariant (cf. time-invariant 
random walk) and then we calculate the entropy from Eq. [5] by using four independent probability distri- 
butions, namely p(x\\y); x, y € X =>■ {p(x||A),p(x||G),p(x||G),p(x||T)}. Within this assumption, Eq. [5] 
becomes 

S(X 1 ,...,X n \\Y)= n yP (x\\y)\np(x\\y), (6) 

x,y£X 

where n y is the number of nucleotides of each type in the template sequence thus fulfilling n y = n - 
The Shannon entropy of a copied strand is no longer dependent on the template sequence but on the 
number of nucleotides of each type on the template and its individual hybridization probabilities. The 
transmitted genetic information would be very poor if replication were taking place in the absence of 
nearest-neighbor base-pair interactions since a number of sequences W(n) = ( n „ C ™„ G „ T ) would be 
passing the same information. If a dependence on position i were explicit due to strong local property 
changes in the environment, the transmitted information would be still poorer since the way boundary 
conditions affect each replication reaction introduces a further uncertainty. The entropy of nucleotides 
ordered with no neighbor influence was initially treated by Wolkenshtein and Eliasevich |16j and amended 
by Davis |17j to introduce wrong incorporations. However these authors did not calculate the error 
probabilities which we introduce next. 

The probability of incorporating a nucleotide x in front of a nucleotide y can be estimated from 
experimental data as 



p{x\ 



z(y) 



cxp 



-AG* 

b 

kT 



(7) 



z(y) 



xex 



exp 



-AG| 
kT 



(8) 



where AG^ is the energy released (negative) or absorbed (positive) upon pairing a nucleotide x to another 
y on the template strand and eventual stacking of the newly formed base-pair. Energies AG^ are obtained 
from experimental data |12] at 37° C (polymerization occurs in vivo in these conditions), as discussed in 
Appendix A. Then, the probabilities are: 



P = (p(x||y)) = 



/ 0.054 0.012 0.035 0.765 \ 

0.033 0.011 0.789 0.034 

0.144 0.961 0.118 0.128 

\ 0.769 0.016 0.058 0.072 / 



(9) 
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with ~^2 xGX p(x\\y) = 1. Note that the matrix elements follow the order established by the alphabet 
sequence x, y G X = {A,C,G,T}. The WC base-pairs appear on the anti-diagonal and, as expected, 
their propabilities are the largest. Although the real fidelity of a polymerase is much higher than that 
represented by these probability values, many of its features are addressed by this matrix. Namely, as 
described in |28j for the human mitochondrial DNA polymerase, misincorporations which involve a G 
are clearly favored and, with the exception of G ■ A, the error G ■ T is the most common. With the same 
exception, incorporations onto T and G are favored over C and A. As also measured by [28], the error 
C ■ C is the least probable. Moreover, it is found a discrimination between misincorporations G ■ T and 
T-G. A significant conclusion from this calculation is that errors in polymerization are mainly determined 
by the thermodynamic affinity of nucleobases. 

A similar discussion based on the raw free energy measurements instead of their associated proba- 
bilities was formerly established in |29j . The role of kinetic and steric influence in replication fidelity 
and the importance of mismatch repair in error propagation was also therein discussed in the context of 
these thermodynamic data. The effect of water exclusion in the active site of DNA polymerases has also 
been studied. In particular, base-pair interactions were shown to be stronger than would otherwise be 
expected [30,31 , what should enhance the contrast among the probability values in matrix Eq. [9] Water 
exclusion in the DNA double helix has also been shown to decrease the axial base-stacking interactions, 
as reflected in the DNA stretch modulus [32]. It is therefore expected a type-dependent polymerase 
mechanism that optimizes fidelity towards reported values of 1 error out of ~ 10 5 incorporated bases [25] . 
The presence of exonuclease activity would enhance fidelity towards reported values of 1 error out of 
~ 10 6 — 10 7 incorporated bases [28]. Both reactions have been reported to be out of equilbrium. Al- 
though not reaching these values, the entropy of the polymerized strand in equilibrium is lower than that 
represented by Eq. [9] (i.e. the transmitted information is higher) due to the presence of nearest-neighbor 
interactions between base-pairs. We analyze this influence in the next section. 

The entropy when no-neighbor interactions are present for a 'class' of DNA templates (i.e. with fixed 
UAi nc, tig an d tit) is calculated by introducing the matrix values of Eq. [9] in Eq . [61 and by using the 
symbol parameters y*, i = 1, . . . , n. A representative, template- independent value of the absolute entropy 
of polymerization within this approximation can be calculated in the limit of uniform incorporation of 
nucleotides. This calculation is performed in Appendix B by using the formalism of stationary random 
walk [37] and the result is s = 0.643 k per nucleotide (k/nt). 

Entropy of nucleotides ordered within nearest-neighbor influence 

In this section we consider that the incorporation of a new nucleotide depends not only on the nucleotide 
at position i in the template but also on the recently formed base-pair at position The physical nature 
of this dependence is the base-stacking interactions between base-pairs which make more probable to place 
a new nucleotide by a WC union than other combination since the eventual secondary structure of the 
resulting double-stranded polymer is more stable. The nearest-neighbor interactions implicitly make both 
the probability and the entropy of the replicated strand become sequence-specific and increase the fidelity 
of the transmitted information. As pointed out before, the fact that a nearest-neihgbor approximation 
is sufficient to address secondary structure effects in the hybridization of two strands is supported by 
former literature [12] . Therefore we introduce the hybridization energies from [12] as coefficients AG%'%" , 
which represent the free energy of positioning a nucleotide x in front of a template nucleotide y when the 
previously formed base-pair is x —y , to calculate the Shannon entropy of a replicated DNA strand. 

We assume that the polymerase is able to recognize the secondary structure of the double-stranded 
polymer, as represented in Fig. [T]C, and thus decide the best match for each incorporated nucleotide. In 
doing this assumption, we use the fact that a polymerase is continuously grabbing nucleotides at random, 
fluctuating between an open and close conformation. The polymcrasc-dNTP binding energy stabilizes a 
close conformation of the enzyme which is used to attempt the incorporation of each grabbed nucleotide 
to the template at each position i. Wrong matches are released in their initial triphosphate state and 
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best matches are hydrolyzcd with release of PPi and eventually branched to the previously incorporated 
nucleotide in the growing complementary strand through a polymerase-catalyzcd phosphodiester bond. 
The fluctuating state of the polymerase is restored after the nucleotide incorporation by using part of the 
energy from the dNTP hydrolysis. This structural reset enables the enzyme to translocate to the next 
template position [T41I33] and leads to its memory erasure [6l l24l[25j . 

Polymerization can be considered a reversible reaction if the polymerase is not included in the process. 
The inverse reaction, the logical unrcading |19) in which a branched nucleotide in its monophosphate state 
(dMNP) is unbranched and released in a triphosphate state is the so-called pyrophosphorolysis, which is 
not biologically related to an editing process. The occurrence of this reaction depends on the concentration 
of PPi in the reservoir, which we suppose to be low compared to that of dNTPs. In the exonucleolysis 
reaction, which is performed by a different enzyme or by the 'exo' domain of some polymerases [10] . the 
initial state of the DNA template is recovered but not that of the cleaved nucleotide since it is released 
in a monophosphate state with the consequent dissipation of part of the energy from the phosphodiester 
bond breakage. Exonucleolysis is thus irreversible, as expected from an editing process ffifl!25j . However, 
having in mind that the reservoir is not affected by the substitution of a few dNTPs for dNMPs, the total 
effect of polymerization and exonucleolysis can be assumed to encompass a reversible copying process. 
This assumption will become clearer later. 

Based on the previous-neighbor-influence assumption, the conditional probabilities are truncated for 
I = 1 in Eqs. [2] and [3] Then, the probability distributions are given by coefficients p(xi\xi-i\\yi\yi-i), 
such that ^2 X eX p{xi\xi-\ \ \yi\yi-i) = 1, which implies that there is at least one nucleotide Xi that binds 
to a nucleotide yt for a previously formed base-pair made up of a nucleotide Xi-\ in front of It is 

also assumed that the energies for nearest neighbors fulfill AGJ"^ = AG^.° (strand symmetry). This 
assumption is true provided that hybridization energies do not show a higher-order dependence on the 
nearest neighbors within experimental resolution [12J . It is approximated otherwise |34j . 

The total energy of a configuration, v, made up of a sequence x hybridized on a template sequence 
y is E v = E(xi, . . . ,x„||y) = J2"=i E ( x i\ x i-l\\Vi\yi-l) ( sce Appendix C), where E (x i \x l -i\\y l \y l -i) is 
the energy of pairing nucleotide Xi on yi provided that the previously incorporated nucleotide Xi-i is 
already hybridized on the template nucleotide yi-\. If the energies arc not affected by local changes, 
such as nucleotide concentration or temperature and ionic gradients, their values will only depend on the 
position, i, on the template through the values of yi. 

The Shannon entropy can be calculated by using a partition function formalism, according to a 
hereafter labeled as Ising mechanism, or by using a Markov chain formalism, according to a hereafter 
labeled as Turing mechanism. 



Ising and Turing mechanisms 

The Ising mechanism corresponds to an Ising model and it is thus calculated by using a partition function. 
As we show below, the degree of reversibility of this mechanism depends on the stability of the base-pairs 
as represented by their free energies. The conditional probability at each step is given by (see Appendix 
C): 

-^E(x i \xi- 1 \\v i \y i -x) f ( r .\\ v . ~| 

p [Xi \xi-x \\Vi \yi-i) = — — — r (10) 

£ x < e -^(*>*-illf*l3"-i)/ ( X '.\\ yi , . . . , y n ) 

where 



f(xi\\vi,...,y n )= e-^^^l^l^-.-e-^^l 3 '"- 1 !!^!^- 1 ). (11) 

Xi+l ,....x n 

The Turing mechanism, which as mentioned is based on the Markov chain formalism, is implicitly 
irreversible. The probability of placing a nucleotide x in front of a template nucleotide y at position i 
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within this formalism is given by (see Appendix C): 

e -^-B(x«|x i _i||S( < |i/i-i) 
P^Xi-xWyiWi-i) = — —j— — r. (12) 

l g-l 3E { X i\ X i-^\\Vi\Vi-l) 

Eq. llOl rcpresents a probability which depends on the index i through the symbol value yi and through 
the length n of the template strand chain. Therefore, in the Ising mechanism, for which nucleotides are 
assumed to freely branch and unbranch, the final macroscopic state is affected by the finite length of 
the genome. In the Turing mechanism, Eq. 1121 on the contrary, the probability only depends on the 
position i through the sequence. The latter is the process which takes place in polymerization in the 
absence of exonucleolysis because it is associated to a unidircccional incorporation of nucleotides. The 
former allows the already incorporated nucleotides to be replaced by new nucleotides and therefore it 
naturally introduces the effect of exonucleolysis in equilibrium. In the limit of very negative free energies 
for WC base-pairs (high stability) and very positive free enegies for wrong base-pairs (low stability), 
the probabilities in Eg. 1101 and 1121 become Kronecker delta-like functions (p(xi\xi-i\\yi\yi~i) = 5%. yi x 
5a 4 _ x y 4 _ I j where 5% y = 1 if as is WC-complementary to y and zero otherwise) and both calculations 
converge to S = 0, that is, information transmitted in the absence of errors. 

The entropy of replicating monotonous sequences, polydA, polydC, polydG, and polydT, and periodic 
and random sequences is represented in Fig. [2] for both the Ising and Turing mechanisms. As shown, 
the absolute entropy per incorporated nucleotide decreases and converges to the thermodynamic limit 
very rapidly, within ~ 13 nucleotides, as further confirmed by Montecarlo simulations of the internal 
energy (see Appendix D). For periodic sequences, the convergence reflects an attenuated periodicity which 
correlates with the template sequence. This trend is also observed for the internal energy, Fig. |3l and the 
Hclmholtz free energy, Fig. |4] The entropy when no neighbor interaction is assumed is also plotted for 
comparison. This approximation for independent, non-indentically distributed random variables is not 
dependent on the kind of calculation and the resulting entropy is always higher than that obtained in the 
presence of nearest neighbor interactions. This result is expected since when correlations are established 
among nearest neighbors which lead to conditional probabilities, the probability of error decreases and 
the absolute entropy is closer to zero [2"T] . 

Figure [2] also reflects another feature which takes place when neighbor interactions are taken into ac- 
count: the absolute entropy for the Ising mechanism is lower than for the Turing one. Although in both 
mechanisms all configurations are accessible, the number of pathways through which each configuration 
is accessible in the Turing mechanism is lower. As stated above, the Ising mechanism is reversible and 
thus it naturally includes the effect of exonucleolysis, what consistently leads to the lowest entropy and 
consequently to the lowest number of errors in the replicated DNA, in agreement with previous proof- 
reading analysis HUES]- Finally, Fig. reveals a large entropic discrimination between polymerizing a 
polydG and a polydC which is not found between a polydA and a polydT. This feature is consistent with 
what was shown for the case in which no neighbor interactions were taken into account (see matrix Eq. [9] 
in previous section) . This effect is purely entropic since the behavior of the internal and the Hclmholtz 
free energies (Figs. [3] and 01 respectively) do not exhibit such discrimination. 

Error rates 

Errors are defined as non-WC unions. A gross estimation of the probability of error, p er ror, can be ob- 
tained through the Shannon-McMillan-Brciman theorem [27], which states that — (\/n)k\\\p(x\, . . . , x n \ |y) — > 
s(X) with probability 1, where s{X) is the entropy per incorporated nucleotide (cf. entropy rate in 
a random walk). Then, by defining the probability of error from the geometric mean 1 — p error = 
p(xi, . . . , x n \\y) l / n , it follows that p error — > 1 — exp(— s(X)/k). This definition implicitly assumes that 
each incorporated nucleotide is independent of the previous base-pairs and therefore p er ror thus obtained 
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represents a higher bound. The entropy per incorporated nucleotide, as extracted from Fig. [5]A for gen- 
eral random sequences with equal number of nucleotides of each class is s ~ 0.1 k/nt for the reversible 
process, which leads to p err or ~ 10 _1 . 

The probability of error can be more realistically calculated by simulating a large number of sequences 
according to the joint probability dictated by either the Ising or the Turing mechanism. Montccarlo 
generation of sequences (see Appendix D) according to the templates studied in Fig. Ogives rise to p error ~ 
10~ 2 for the Ising mechanism (the lowest average probability of error being for polydC template, ~ 10~ 3 ) 
and p e rror ~ 10 _1 for the Turing mechanism (average probability of error for polydC, ~ 10~ 2 ). As a 
cross-check, we note that in the absence of nearest neighbor interactions, these Montecarlo simulations 
give rise to an average p err or ~ 0.2 (the lowest p er ror, again, for polydC template, p er ror ~ 0.04), 
consistent with the information provided by matrix Eq. [9j 

The average probability of error therefore decreases, firstly, in the presence of nearest-neighbor inter- 
actions and, secondly, for the Ising mechanism since, as explained above, this mechanism contains the 
effects of exonucleolysis. Although the average probabilities of error in real, non-equilibrium replication, 
can be much lower than the ones calculated here in equilibrium, similar error rates have been reported 
for some polymerases [36| . 

Internal and Free Energies 

Figure [3] shows the behavior of the mean energy (see Appendix C) which is released upon incorporating 
a new nucleotide at each step of the polymerase. The stored information in the double-stranded polymer 
gives rise to a higher internal energy in absolute value for the Ising mechanism than for the Turing one 
since the number of WC unions, which involve stronger interactions than other pairing possibilities, is 
higher for the former mechanism. 

An error decreases the stability of a microstate (i.e. decreases in absolute value the (negative) energy, 
E v , of an individual nucleotide arrangement, v, in the copied strand) with respect to the case of a correct 
(WC) match, not only by contributing with an either less negative or positive energy at the position 
of incorporation, i, but also at the next step, i + 1, independently of whether the next incorporated 
nucleotide is a correct match or another error. This does not happen for the case in which no nearest- 
neighbor influence is taken into account since in that case an error only affects the stability of a microstate 
at the position where the wrong nucleotide is incorporated. Therefore, if the number of errors when the 
influence of previous neighbors is taken into account is not much smaller than for the case in which 
no influence is taken into account, the total energy of a microstate will be on average lower (i.e. will 
give rise to a less stable configuration) for the former. This is why the internal energy for the Turing 
mechanism is lower in absolute value compared to that in the absence of nearest-neighbor interactions. 
The internal energy for the Ising mechanism is however higher in absolute value than for the case in 
which no nearest-neighbor interactions are taken into account because in this mechanism the number of 
errors is much smaller than for the Turing one (see Fig. |3|). 

The Hclmholtz free energy, Fig. [U reflects that the information generated under the Ising mechanism 
is more significant than that generated under the Turing one since the former produces a smaller number 
of errors. The free energy also provides information about the spontaneity of copying a template DNA 
strand. As shown, for a given initial free energy it is more favorable to write a DNA replicate under a 
process in which each copied symbol does not have a memory of the copying history. When neighboring 
interactions are considered, the process for which nucleotides are not written by following a directionality 
(Ising mechanism) is favored over that in which the symbols must be copied on a directional one-after-one 
basis (Turing mechanism) in the 3' to 5' template sense. It is important to note however that the real, 
non-equilibrium process involves a more complex enzymatic coordination for the former procedure, what 
actually involves a different physical pathway that could make such procedure become more unfavorable 
(e.g. see ITIUDIESI ) - 
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Initial entropy of the nucleotides from the reservoir 

The entropy of the initial state is determined by a distribution of probability that a certain nucleotide is 
grabbed by the polymerase. Hence, this entropy addresses the order in which individual nucleotides reach 
the 'pol' site of the enzyme with independence of whether they would be eventually incorporated to the 
replicated strand or discarded back to the reservoir. This entropy is not unique and depends majorly on 
the concentrations of the nucleotides. For a reservoir saturated with each nucleotide class, the entropy 
can be calculated by the Boltzmann formula, Sq = fclnW(n), where n is the number of microstates 
compatible with a 'macrostatc' W . The nucleotide numbers, n x , x € X, fulfills n x ~ n and therefore 
W(n) is given by the multinomial coefficient. As expected, for large n, Gibbs and Boltzmann formula 
provide the same values, namely 



where p n (x) = is the probability of grabbing a nucleotide of type x which is in the reservoir at 
concentration n x /n. A similar approach to this entropy can be extracted from the case of the ideal gas. 
The informational contribution to the entropy for this system is the same, as expected, and has been 
analyzed in Appendix E. 

By setting equal concentrations for all the nucleotide types, the initial entropy is sq — fcln4 = 1.39 
k/nt. This is the value addressed by setting W(n) = 4™, which better describes the infinite number 
nucleotides of each class contained in an ideal reservoir. The entropy difference between the initial and 
final states is therefore As ~ — 1 k/nt. This entropy change cannot be much larger than this value since 
the lowest (non-equilibrium) final entropy is in any case S > 0. However, approaching S = 0, or zero 
error rate, involves an ever increasing energetic cost with subsequent dissipation fTSirrS], in accord with 
the third law of thermodynamics. 

Discussion 

A common mechanical action of linear molecular motors such as kinesin and polymerases is translocation 
along a molecular track. However, the main role of polymerases is the copying fidelity of the DNA, being 
this double-stranded polymer a particular molecular track which stores information. The selection of one 
correct nucleotide at each translocating step of the polymerase constitutes a mechanism which needs of 
energy as well. We have calculated the entropy balance of a system of nucleotides randomly distributed 
in a reservoir which are finally incorporated into a copied strand according to a template DNA in the 
absence of energy dissipation. To that end, we have evaluated the Shannon entropy at both the initial 
and the final states of the nucleotide symbols by connecting these states through an equilibrium pathway. 
We show that the entropy related to fidelity must be reduced from the initial state in ~ 1 k at each step of 
the polymerase. Given that the initial internal energy of the free nucleotides is (3/2) kT/nt (equipartition 
theorem), their associated entropy is ~ 1.4 k/nt, and that the final free energy is that shown in Fig. 01 
the free energy invested in copying fidelity must be at least ~ 2 kT/nt. 

A gross analysis of the bulk chemical equilibrium between correct/incorrect incorporation of nu- 
cleotides shows that the energy required to maintain a copying fidelity of one wrong nucleotide out of 
10 m , i.e. an error rate p err0 r ~ 10 _m , is AG = — kT\np error = 2.3m kT [36], similar to but larger 
than the above analysis for a low number of errors (i.e. for m = 1, 2 and 3) since in this estimation the 
polymerase is not assumed to work very slowly. In particular, for real error rates of ~ 10 -3 — 1CP 7 , the 
free energy is much larger than 2 kT/nt. If this energy is added to the thermodynamic efficiency |37| of 
polymerases, the resulting values would be much higher than those estimated from just the analysis of 
the translocation mechanism |13j . The contrast between this analysis and the equilibrium polymerization 




(13) 
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scheme presented in this article demonstrates, on the one hand, that the copying pathway in polymer- 
ization (which may be coupled to the translocation one) is far from equilibrium, and on the other hand, 
that the final state of the nucleotides in the copied strand depends on whether the copying mechanism 
ocurrs in or out of equilibrium. The latter implies that information managing results in very different 
fidelities depending on how far from equilibrium the copying mechanism takes place, in agreement with 
former literature |18j . and on which the specific polymerase mechanism is. 

Non-equilibrium paths can certainly destroy the information acquired in equilibrium, but they can 
also amplify it. We have explained that DNA-polymerase structural fitting is responsible for increasing 
dynamical order in the replication process. Then, the analysis presented here shows a universal higher 
bound of absolute entropy in polymerization and, subsequently, an error tolerance for the copying fidelity. 
Each individual polymerase actually uses a specific replication mechanism in the presence or absence of 
cxonuclcolysis which sustains an associated error rate evolutionarily coupled to its cellular line develop- 
ment. In our analysis, we include the effects of the previous neighbor base-pair whose physical nature 
is the base-stacking interactions. These interactions are responsible for the secondary structure of DNA 
— the double helix — and its correct formation is supervised by the DNA polymerase through structural 
fitting [7H9]. We show that such supervising mechanism reduces the entropy of the copied strand with 
respect to the case in which these interactions are neglected, a consequence of the fact that information 
fidelity increases in the presence of conditional probabilities [27] . 

Finally, we show that the inclusion of the nearest neighbor interactions leads to different absolute 
entropies of the polymerized strand depending on whether nucleotides arc incorporated in either an 
irreversible or a reversible process. The latter presents the lowest absolute entropy, which is consistent 
with the error reduction generated by proofreading |19p35j , a mechanism in which nucleotides are removed 
by exonucleolysis in a backtracking motion of the polymerase or by the presence of an exonuclease 
enzyme. Error rates within these two equilibrium mechanisms with nearest-neighbor influence are in the 
1CP 1 — 10~ 3 , better than the most simple scheme where these interactions arc neglected and near some real 
polymerase fidelities |36| . Most commonly reported polymerases however strongly differ from these rates 
what ultimately reflects how far from equilibrium they work. The equilibrium mechanisms described 
here are inherent to more general non-equilibrium polymerization pathways since time evolutions are 
ultimately mediated by the polymerase demon action. 

Although polymerases speed up the replication/transcription reactions, it is important to note that 
translocation for these molecular motors is slower than for transport molecular motors such as kinesin 
and myosin |13j . This fact suggests that non-equilibrium replication pathways are mainly focused on the 
regulation of specific error rates in copying fidelity rather than in the translocation mechanism. 

Conclusions 

We have conceived a probabilistic framework based on structural and single-molecule experimental results 
which models the copy of genetic information by molecular motors through the recognition of DNA 
secondary structure. The link between thermodynamic entropy, which is based on statistical concepts at 
the molecular level, and Shannon entropy, which is based on the processing of information, arises naturally 
within the model. Our mathematical framework provides a connection between entropy and fidelity in 
replication and leads to universal bounds. Error rates similar to the ones theoretically deduced here 
in the stepwise equilibrium limit (< 10~ 3 ) have been measured for some polymerases, what ultimately 
reflects the consistency of this model with the experiments. 

Polymerase 'pol' and 'exo' catalytic residues are conserved throughout evolution: viral, prokaryotic 
and eukaryotic polymerases exhibit common structural domains and replication mechanisms. The exis- 
tence of a certain degree of structural variability and the presence of replicative complexes, which involve 
auxiliary proteins and coordination strategies, should have an effect on fidelity. In particular, these factors 
may regulate fidelity to balance maintenance of genetic identity and the species ability to evolve/adapt, 
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in most cases increasing fidelity to several order of magnitude with respect to the values obtained in this 
work. Our analysis attempts a universal description of polymerase fidelity since only basic assumptions 
common to all polymerases have been made. This analysis is therefore a starting point for developing 
theoretical models describing specific polymerases. In this regard, highly processive polymerases which 
do not require a cooperative association of co-factor proteins like those from some bacteriophages may 
constitute the first targets for specific modeling. 

Developments of the present analysis for specific polymerases can therefore be used to test mechanistic 
hypothesis on polymerase fidelity by contrasting the subsequently calculated error rates to experimental 
results. Progresses in this direction are not only interesting to biology but also to inspire nanotechnolo- 
gies in information processing. Unique to naturally engineered copying/editing nanomachines like the 
polymerase enzyme is the inherently stochastic mechanisms by which they manage classical information, 
in contrast to artificial devices for which fluctuations are undesired events. The theoretical modeling of 
specific polymerases thus represents a physical basis to connect classical information processing in bio- 
logical systems to artificial, nanoscalc platforms, and to open promising avenues in quantum information 
copying strategies. 

Appendix A. Hybridization Energies under no neighbor influence 

Free energies AG^ are obtained from hybridization energies which include the effect of both the base- 
pairing and base-stacking interactions. The former involve the hydrogen bonding between complementary 
nucleotides and the latter mainly contain the hydrophobic interaction between the newly formed base- 
pair and the previous one. These energies have been extensively measured and are summarized in |12) for 
both correct, Watson-Crick (WC) base-pairs and mismatches. In the approximation that we arc dealing 
with here, we consider that these energy levels are degenerate and assume that the hybridization energy 
only depends on the unmatched nucleotide y in the template strand. Then, we take averages over all 
previous base-pair possibilities. For this purpose we define next probability distributions for fixed x-y 
base-pairs: 



where AGy°'y is the energy released upon pairing a nucleotide x to another y on the template strand and 
eventual stacking of the newly formed base-pair with the previous base-pair made up of a nucleotide xq in 
front of another y^. Most of the base-pairs involving two consecutive non-WC associations are unstable 
hybridizations and no data were given in |12j . They involve very unfavorable processes (i.e. very high 
and positive free energies) and we have considered for these cases that AG = +00. Here we use the data 
at 37°G and 150 mM NaCl concentration since polymerization occurs in vivo in these conditions. Then, 
the hybridization energies in the absence of nearest neighbor interactions are given by 




(14) 




(15) 




(16) 



Under these considerations, the free energies in kT units are: 
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AG = (AG*) = 



/ 1.01 


1.70 


0.34 


-1.65 \ 


1.50 


1.78 


-2.78 


1.46 


0.03 


-2.69 


-0.88 


0.14 


\ -1.64 


1.39 


-0.16 


0.71 / 



(17) 



where matrix elements follow the order established by the alphabet sequence x,y G X = {A,C,G,T}. 
Although there is a dependence on both the temperature and the ionic strength, as explained in |12j . the 
latter dependence cancels out in the probability calculation (see Eq. and [8] in the main text). 



Appendix B. Entropy per nucleotide for the uniform process 

The matrix Eq. [9] in the main text can be interpreted as the probability transition matrix in a four-state 
Markov process and thus it is possible to calculate the uniform probability distribution. By using this 
distribution in the limit of uniform incorporation of nucleotides when nearest-neighbor interactions are 
not taken into account, it is possible to set an upper bound to the absolute entropy per incorporated 
nucleotide that is generated in the polymerization of a DNA strand starting from a general template DNA. 
This calculation is useful because this entropy bound docs not depend on the DNA template sequence. 
The uniform distribution fulfills the matrix equation P/x = fi |27j where fi is the uniform probability 
(column) vector. In other words: 

Y,P{x\\v)tiv)=ti*) (18) 

where p(x\\y) and /z(x) are such that ^2 xeX p( x \\v) ~ 1> J2 x ex p( x ) = 1- The entropy per incoporated 
nucleotide (cf. entropy rate in a random walk) for the uniform process, s(X), can be calculated |27| from 
equation 

s{X) = -k Y, My)p(x\\y)\np(x\\y). (19) 

x,yGX 

By using the probability transition matrix Eq. 9 we obtain that the absolute entropy per copied 
nucleotide in DNA polymerization is bounded by s(X) = 0.643 k/nt (37°C). This value has been 
obtained in the limit of, (a), no nearest-neighbor interaction with previously formed base- pairs and, (b), 
uniform incorporation of nucleotides. This entropic upper bound is independent of the template sequence. 



Appendix C. Partition function vs Markov chain 

The state, v, of the system is specified by a sequence of nucleotides x\, . . . , x n replicated in the direction 
5' to 3' according to a template composed of an ordered sequence of nucleotides j/i, . . . ,y n polymerized 
from the 3'-end to the 5'-end (see Fig. O, as denoted by: 

v = {xi,x 2 ,...,Xi,...,x n -i,x n \\y}, (20) 

where y = (j/i, j/2j • • • iVii • • • j Un—i, Un)- %i are variables and yi parameters such that Xi,yi € X = 
{A, C, G, T}. The energy of a state is: 

n 

E v = E(xi,...,x n \\y) = Y E { x i\ x i-l>---' x i\\y(i)) > ( 21 ) 

i=l 
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where y (l) = (yi\yi-i, . . . ,yi) and E(x i \x l -i, . . . , xi||y (i) ) is the energy of pairing nucleotide Xi on y t 
provided that nucleotides (xi, . . . Xj-i) are already hybridized on nucleotides . . . , j/i-i), respectively. 
The probability of a state is: 



P v = Pr{Xi =xi,...,Xn =x„||Y} =p(xi,...,x„||y) 
= P(^i||yi)p(£2|xi||2/2|2/i) • ■ ■ p{x n \x n -x, . . . ,xi||y), 



(22) 



where the last part of the equation is the general expansion of the joint probability as a product of 
conditional probabilities [38]. The mean energy or internal energy of the system is: 



(E)='Y J P U E V = P(xi,...,x n \\y)E(xx,.. .,x„||y) , 

u—1 Xl,. ..,x n 

where N is the number of microstates. The partition function is: 



(23) 



JV 



Z[fi,n) = £)exp = exp(- / 3£(x 1) ...,x„||y)) 



i/=i 



xi,...,x n \ i=l 

where (3 = 1/kT. The probability of a configuration is thus: 



(24) 



Pu = Z- 1 (j3,n)exp(-PE v ) 

exp (-ft Yh=i E ■ ■ , ^i||y(i))) 

£*!,..,< ex P ( , /• (^K-_i, ■ • • :*illyo-) 

n"=i ex P (~PE (xj|xi_i, . . . ,xi||y w )) 
Ex!, Il?=i (-0E (x;.|x;._ x , . . . .xiHyy))) 



(25) 



It is important to note that the sums in the denominator over x[, . . . ,x' n are nested and therefore they 
cannot be factorized as independent sums. In other words, Eq. 1251 expands as: 



e -/3B(xi||yi) e -/3J?(x 2 |xi||y 2 |j/i) . . . g-/3B(x„|x n _i,...,ai | |y) 

V p-/3S(x' 1 || !;i ) ~PE(x' 2 \x' 1 \\y 2 \y 1 ) _ _ _ -0E(x'Jx' rl _ 1 ,...,x' 1 \\y) 

*-~<x , 1 ,...,x / rl 

p-PEixrWy!) p -/3E{x2\x 1 \\y 2 \yi) . . . p -/3E(x„\x n - 1 ,...,x 1 \\y) 



y , 



X 2J x! 



(26) 



The general term of the expansion of p(x\, . . . , x„||y) as a product of conditional probabilities (see Eq. 
is: 
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r , „ v e-^("l"-^-^H^))/(x 1 ,..., a ; i - 1 , a : i || y ) 
p i,i,_i,...,ii y (l) =- — -r- r , (27 

E»{ e-^^^- 1 -^!!^))/ (n, . . .,^-_ 1 ,^||y) 

where 

/(X 1; ... )a;i ||y)= ]T e -M* t+1 |» i ,..^ 1 ||y('+ 1 ))... e _^(x B |* n _ ll ...,x 1 ||y) j (2g) 

Xi+l,...,X„ 

fulfilling ^2 X . p{xi\xi_i, . . . , Xi||y(j)) = 1. For the last nucleotide in the chain, i = n, it follows that 
f(x„\\y n ) = 1 and 

e -^(a;„|x„_ 1 ,...^ 1 ||y) 

p(x n \x n ^,..., Xl \\ y ) = ^- e _g £(< | !C „_ 1 ,..., 3!1 || y) ; (29) 

but, in general, /(xi, . . . , a^Hyi, . . . ,yi) ^ 1 and, therefore, the conditional probabilities depend on the 
index i, i.e. the position in the polymer chain. 

The partition function calculation represents an Ising mechanism in which the nucleotides are placed 
arbitrarily: neither order nor one-by-one basis is implied in this replication procedure (Fig. EH)- Unlike 
this calculation, the Markov Chain calculation implies a Turing Machine mechanism in which nucleotides 
are placed on a one- after-one basis in the 3' to 5' template direction (Fig.[5]B). This mechanism constraints 
the sequence in which the different configurations are accessible and therefore, as shown in the main text, 
the absolute entropy is higher than that from the Ising mechanism. The probability of placing nucleotide 
Xi onto yi within this scheme is given by: 

e -0-E(x i |x i _i,...,xi||y (i )) 

p(xi\xi-i,...,xx\\yu-)) = — — r. (30) 

j2 /e -P E \ x i\ x i-i>—> x i\\y<.i')) 

Hence, the probability of a configuration v is given by: 



pOillyiM^l^il^Iyi) • ■ ■ p{x n \x n -x, . . ., a;i|| y ) 



3 -/3B(xi||2/i) 



Ex' e- 0E ( x '^ Vl ) 



e -0E(x2\xi\\y2\yi) 



-pE{x n \x„ 



■ ,*i\\y) 



E x ' e-^(*-l»- 



^illy) 



(31) 



where wc have used brackets to stress that the factorization of the probability here implies that the sums 
are independent, unlike in the partitition function calculation, Eq. 1261 for which the sum could not be 
factorized. Note that Eg. 1311 also fulfills Ex x P( x i' ■ ■ ■ > x n\\y) — 1) which is a direct consequence of 
Eq.[30] 

As explained in the main text, hybridization energies are only dependent on the previously formed 
base-pair, that is, I = 1. If boundary changes, such as nucleotide concentration or temperature and 
ionic gradients along the DNA polymer, are not present, the energies will only depend on the position, 
i, on the template through the values of yi. This means that there is a single set of energies AG*"^, 
where x,xo,y,yo S X = {A, C, G,T}, and parameters xq and yo refer to nucleotides preceding x and y, 
respectively. AG^'J are the hybridization energy data from |12j . also used in previous Appendix A. Note 
that the fact that the hybridization energies do not depend on i implies that there is also single set of 
probabilities for the Turing mechanism: 
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1 /— AG X °' X \ 

p(x\x \\y\yo) = — rexp r^-), (32) 

Z{x ,y,yo) \ kT J 

/ — A.G X °' X \ 

Z(x ,y,y ) = Vexp ■ (33) 



^ * V kT 

xex v ' 

The dependence on the salt cancentration cancels out in this calculation, as was shown for the th order 
approximation (see Eqs. [71 and |8|) , so these probability distributions are only temperature-dependent 
within the empirical expressions given by |12j . 

For the Ising mechanism, however, the probabilities are affected not only by the sequence parameters 
yt but also by the length, n, of the template, and therefore, there is an implicit dependence on i (compare 
Eqs. [ID] and H] in the main text or Eqs. [271 and [30)) . 



Appendix D. Montecarlo simulations 

Simulations of the Internal energy and the probability of error for both the Ising and Turing mechanisms 
have been generated for template sequences of 20 nucleotides. For the former, the partition function 
is calculated by using the classical Metropolis-Montccarlo procedure [3H]. For the Turing mechanism, 
each decision step is ruled by the probabilities p (xi\xi-i\\yi\yi-i), as established by Eg. 1121 in the main 
text. More in depth, a symbol Xi G X = {A, C, G, T} and a number r G [0, 1] are chosen at random at 
step i. Then, considering the outcome of previous step i — 1 and the template symbol yi, the symbol x 
incorporated at position i is that fulfilling the condition r < p (xi\xi—i\\yi\yi-\) . 

For both kind of calculations, averages (E) are taken over 10 7 iterations, i.e. over 10 7 sequences 
generated with each procedure. Comparison of the exact values shown in Fig. 3^4 with the ones simulated 
here show a discrepancy within 3.6%, thus confirming a rapid convergence to the thermodynamic limit. 
The calculations for independent, non- identically distributed variables show discrepancies from the exact 
calculations in the fifth significant figure for the simulations based on the above procedure, which confirms 
the validity of the above explained algorithm. 



Appendix E. Informational and Configurational terms of the 
Entropy in an ideal gas 

The entropy per particle of a multicomponcnt ideal gas can be calculated by using the Sackur- Tetrode 
formula for a mixture of n particles of types x G X = {A, C, G, T} and masses m x occupying a total 
volume V as follows: 



n z — ' n Tlx 2 n 



5 , / 2inn T kT 

3 +ln hJ- 



-k ^ P»( a; ) foPnix) - k ^2 Pn{x) In 
xex xex 

si{n x /n) + s c (x, C) + s , 



ih 3 



V (2Trm x kTf /2 1 



h 
2 



(34) 
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where p n {x) = n x /n, as defined in the main text, sq = (5/2)fc is a constant, si{n x /n) = — k ^2 x ^xPn{x)x 
m.p n (x) is the herein labeled as informational entropy of the initial state and sc{x, C) is the herein labeled 
as configurational entropy, 

s c (x,C) = -kJ2Pn(x)ln(CAf(x)), (35) 

which depends on the type of particle, x, and the total concentration of particles, C = n/V. Ai{x) is a 
generalized thermal wavelength for particles in a liquid, introduced here as: 

1 ^ n f( x ) v , 2trm x kT' ^ 

with nf(x) a thermal refractive index of particles, x, in a liquid, which becomes n^ ac = 1 for particles 
in vacuum, since this is the case of the classical ideal gas. In this scheme, the thermal refractive index 
modifies the thermal wavelength of classical particles in a liquid as A; = A/nf, and the dispersion relation 
as E — (nf)~ 2 p 2 /2m. 

For particles of similar mass, chemical composition and structure, as it is the case of nucleotides, we 
can assume that both A; and nf are the same for the four types of particles. In these conditions, the 
configurational entropy is only a function of the total concentration of nucleotides, sc{C) = — k In (CAf ) . 
The molar concentration of nucleotides in a reservoir for single-molecule experiments and for in vivo 
replication is certainly of ~ 50 fj,M . The mass of the deoxyribonucleotide monophosphates is m x ~ 330 
Da. Then, the thermal wavelength is A ~ 10~ 12 m and the configurational entropy is sc — 26 k/nt, 
which is much larger than the informational entropy, sj = 1.39 k/nt (see the main text). 
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Figure 1. Sketch of DNA replication. (A) Variables, x, parameters, y, and sequence position i are 
represented together with 5' and 3' ends on both template and replicated strands. (B) Polymerase 
(green) replicating a template DNA strand. Emphasis is placed on the linearity and directionality of the 
process. I is the number of nucleotides that this nanomachine covers when it is bound to the DNA. (C) 
Polymerase replicating DNA. Emphasis is placed on the recognition of errors based on the 
sequence- induced secondary structure of the DNA: the structure of DNA polymerase (a 'palm', [7]) and 
dsDNA (a double helix) are evolutionarly adapted to optimally fit each other when Watson-Crick 
base-pairs are formed. A complete turn of the double helix in B-form is I ~ 10 base-pairs. 
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Figure 2. Entropy of replication in equilibrium. Each panel shows the entropy per nucleotide in 
the absence (solid line) and presence of nearest neighbor influence for an Ising mechanism (dashed lines) 
and for a Turing mechanism (dotted lines). (A) Monotonous template sequences: black lines, polydA; 
red lines, polydC; green lines, polydG; and blue lines, polydT. (B) Black lines, periodic template 
sequence: ACGTACGTACGTA. . . ; red lines, random template sequence TCCGAGTAGATCT ... 
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Figure 3. Internal Energy of replication in equilibrium. Each panel shows the mean energy per 
nucleotide in the absence (solid line) and presence of nearest neighbor influence for an Ising mechanism 
(dashed lines) and for a Turing mechanism (dotted lines). (A) Monotonous template sequences: black 
lines, polydA; red lines, polydC; green lines, polydG; and blue lines, polydT. (B) Black lines, periodic 
template sequence: ACGTACGTACGTA. . . ; red lines, random template sequence TCCGAGTAGATCT 
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Figure 4. Helmholtz Free Energy of replication in equilibrium. Each panel shows the free 
energy per nucleotide in the absence (solid line) and presence of nearest neighbor influence for an Ising 
mechanism (dashed lines) and for a Turing mechanism (dotted lines). (A) Monotonous template 
sequences: black lines, polydA; red lines, polydC; green lines, polydG; and blue lines, polydT. (B) 
Black lines, periodic template sequence: ACGTACGTACGTA...; red lines, random template sequence 
TCCGAGTAGATCT ... 
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Figure 5. Thermodynamic analysis of the Entropy. (A) Ising Mechanism: Nucleotides are 
branched on the template strand without constrainsts of order, direction of replication or number of 
nucleotides placed at a time. The calculation is based on the partition function formalism. (B) Turing 
Mechanism: Nucleotides are branched from the 3'-end to the 5'-end of the template strand on a 
directional one-after-one basis. The calculation is based on the Markov chain formalism. 



